{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline\n",
    "import os\n",
    "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "using Keras version: 2.2.4\n"
     ]
    }
   ],
   "source": [
    "import ktrain\n",
    "from ktrain import text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building a Chinese-Language Sentiment Analyzer\n",
    "\n",
    "In this notebook, we will build a Chinese-language text classification model in 3 simple steps. More specifically, we will build a model that classifies Chinese hotel reviews as either positive or negative.\n",
    "\n",
    "The dataset can be downloaded from Chengwei Zhang's GitHub repository [here](https://github.com/Tony607/Chinese_sentiment_analysis/tree/master/data/ChnSentiCorp_htl_ba_6000).\n",
    "\n",
    "(**Disclaimer:** I don't speak Chinese. Please forgive mistakes.)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## STEP 1:  Load and Preprocess the Data\n",
    "\n",
    "First, we use the `texts_from_folder` function to load and preprocess the data.  We assume that the data is in the following form:\n",
    "```\n",
    "    ├── datadir\n",
    "    │   ├── train\n",
    "    │   │   ├── class0       # folder containing documents of class 0\n",
    "    │   │   ├── class1       # folder containing documents of class 1\n",
    "    │   │   ├── class2       # folder containing documents of class 2\n",
    "    │   │   └── classN       # folder containing documents of class N\n",
    "```\n",
    "We set `val_pct` as 0.1, which will automatically sample 10% of the data for validation.  Since we will be using a pretrained BERT model for classification, we specifiy `preprocess_mode='bert'`.  If you are using any other model (e.g., `fasttext`), you should either omit this parameter or use `preprocess_mode='standard'`).\n",
    "\n",
    "**Notice that there is nothing speical or extra we need to do here for non-English text.**  *ktrain* automatically detects the language and character encoding and prepares the data and configures the model appropriately.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "detected encoding: GB18030\n",
      "Decoding with GB18030 failed 1st attempt - using GB18030 with skips\n",
      "skipped 109 lines (0.3%) due to character decoding errors\n",
      "skipped 9 lines (0.2%) due to character decoding errors\n",
      "preprocessing train...\n",
      "language: zh-cn\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "done."
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "preprocessing test...\n",
      "language: zh-cn\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "done."
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "trn, val, preproc = text.texts_from_folder('/home/amaiya/data/ChnSentiCorp_htl_ba_6000', \n",
    "                                            maxlen=75, \n",
    "                                            max_features=30000,\n",
    "                                            preprocess_mode='bert',\n",
    "                                            train_test_names=['train'],\n",
    "                                            val_pct=0.1,\n",
    "                                            classes=['pos', 'neg'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## STEP 2:  Create a Model and Wrap in Learner Object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Is Multi-Label? False\n",
      "maxlen is 75\n",
      "done.\n"
     ]
    }
   ],
   "source": [
    "model = text.text_classifier('bert', trn, preproc=preproc)\n",
    "learner = ktrain.get_learner(model, \n",
    "                             train_data=trn, \n",
    "                             val_data=val, \n",
    "                             batch_size=32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## STEP 3: Train the Model\n",
    "\n",
    "We will use the `fit_onecycle` method that employs a [1cycle learning rate policy](https://arxiv.org/pdf/1803.09820.pdf) for four epochs.  We will save the weights from each epoch using the `checkpoint_folder` argument, so that we can go reload the weights from the best epoch in case we overfit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "begin training using onecycle policy with max lr of 2e-05...\n",
      "Train on 5324 samples, validate on 592 samples\n",
      "Epoch 1/4\n",
      "5324/5324 [==============================] - 54s 10ms/step - loss: 0.3635 - acc: 0.8422 - val_loss: 0.2793 - val_acc: 0.8801\n",
      "Epoch 2/4\n",
      "5324/5324 [==============================] - 42s 8ms/step - loss: 0.2151 - acc: 0.9170 - val_loss: 0.2501 - val_acc: 0.9223\n",
      "Epoch 3/4\n",
      "5324/5324 [==============================] - 42s 8ms/step - loss: 0.1153 - acc: 0.9591 - val_loss: 0.2267 - val_acc: 0.9257\n",
      "Epoch 4/4\n",
      "5324/5324 [==============================] - 42s 8ms/step - loss: 0.0438 - acc: 0.9859 - val_loss: 0.2596 - val_acc: 0.9324\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7f4409157438>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "learner.fit_onecycle(2e-5, 4, checkpoint_folder='/tmp/saved_weights')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although Epoch 03 had the lowest validation loss, the final validation accuracy at the end of the last epoch is still the highest (i.e., **93.24%**), so we will just leave the model weights as they are this time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspecting Misclassifications"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "----------\n",
      "id:299 | loss:7.37 | true:neg | pred:pos)\n",
      "\n",
      "[CLS]酒店位置佳，出入西街比较方便；观景房是观西街的景，晚上虽然较吵但基本不会影响到睡眠；酒店对面就是在携程上预订自行车的九九车行，取车方便。通过携程订[SEP]\n"
     ]
    }
   ],
   "source": [
    "learner.view_top_losses(n=1, preproc=preproc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using Google Translate, the above roughly translates to:\n",
    "```\n",
    "\n",
    "Hotel location is good, access to West Street is more convenient; viewing room is the view of Guanxi Street, although the night is noisy, but it will not affect sleep; the opposite side of the hotel is the Jiujiu car line booking bicycles on Ctrip, easy to pick up. By Ctrip\n",
    "\n",
    "```\n",
    "\n",
    "Although there is a minor negative comment embedded in this review about noise, the review appears to be overall positive and was predicted as positive by our classifier. The ground-truth label, however, is negative, which may be a mistake and may explain the high loss.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Making Predictions on New Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = ktrain.get_predictor(learner.model, preproc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predicting label for the text\n",
    "> \"*The view and the service of this hotel were both quite terrible.*\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'neg'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.predict(\"这家酒店的风景和服务都非常糟糕\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predicting label for:\n",
    "> \"*I like the service of this hotel.*\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'pos'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.predict('我喜欢这家酒店的服务')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Save Predictor for Later Deployment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "p.save('/tmp/mypred')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = ktrain.load_predictor('/tmp/mypred')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "'pos'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# still works\n",
    "p.predict('我喜欢这家酒店的服务')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
